DeekSeek 浪潮下的思考——自顶向下学习一切

戴政奕

这个题目源自于这样一本书

计算机网络——自顶向下方法

Benchmark performance

benchmark

从 Benchmark 看 DeepSeek 能够处理的问题

数学推理能力：AIME 2024
基础的编程能力：Codeforces
对专业知识的理解：GPQA Diamond
基础数学知识：MATH-500
知识的广度：MMLU
Debug 能力：SWE-bench Verified

一些例子

AIME 2024: Let \(x,y\) and \(z\) be positive real numbers that satisfy the following system of equations: \(\log_2\left({x \over yz}\right) = {1 \over 2}\) \(\log_2\left({y \over xz}\right) = {1 \over 3}\) \(\log_2\left({z \over xy}\right) = {1 \over 4}\) Then the value of \(\left|\log_2(x^4y^3z^2)\right|\) is \(\frac{m}{n}\) where \(m\) and \(n\) are relatively prime positive integers. Find \(m+n\).
Codeforces: Capitalization is writing a word with its first letter as a capital letter. Your task is to capitalize the given word. Note, that during capitalization all the letters except the first one remains unchanged.
GPQA Diamond: Two quantum states with energies E1 and E2 have a lifetime of 10^-9 sec and 10^-8 sec, respectively. We want to clearly distinguish these two energy levels. Which one of the following options could be their energy difference so that they be clearly resolved?
MATH-500: Convert the point \((0,3)\) in rectangular coordinates to polar coordinates. Enter your answer in the form \((r,\theta),\) where \(r > 0\) and \(0 \le \theta < 2 \pi.\)
MMLU: Find the degree for the given field extension \(Q(\sqrt{2}, \sqrt{3}, \sqrt{18})\) over \(Q\).

续

SWE-bench Verified:

SWE-bench Verified Example

基础的数学推理、基础的编程、一定的 debug 能力、广博的知识面

自顶向下意味着什么

快速地提炼出重点（A top-down approach）
- 在已经有编程经验的情况下，快速地掌握一门新的编程语言
- Windows 开发（从数以万计的开发者文档中解脱出来）

自从 2023 年 GPT-4 问世以后，每个寒/暑假我都能够在大语言模型的帮助下，编写/维护一些个人项目。

很快地了解新东西，并立即上手使用
- 运维（云服务器）
- 个人博客设计、编写、部署
- 基于 Quarto、Reveal.js 等制作的幻灯片
- 个人代码仓库

如果没有大语言模型，我不可能在短时间内搭建、部署好这些东西：docker、web 服务器配置文件、代理……

续

从细节中解脱出来
- 实现一个自制的操作系统
  - 与操作系统有关的一切琐碎细节
  - BIOS / UEFI Specification
  - CPU 手册（IA-32、AMD64）
  - 工具链（汇编器、C 语言编译器、GNU Make……）
  - 调试（Bochs、QEMU、GDB……）
  - ……

当你有一个 idea 的时候，实现这个 idea 所涉及的一切细节。

其他
- 为源代码添加详细的注释
- 把日语的注释翻译一下
- 润色一下英文写作

生而为人，我很抱歉

无法像大语言模型那样以极快的速度读完数以万计的文档
无法精确地记住每个细节，不是在遗忘就是在对抗遗忘的路上

君子性非异也，善假于物也。

为什么要学习“任何东西”

提出正确的问题，快速提升自己
大胆提问，小心求证（试错）
尝试把一些曾经不太可能做到的事变得可能

对于计算机专业，我个人认为核心素养无非有两条：数学和编程。有了好的想法，没有好的代码去实现这个想法，别人很难有理由认同你的想法。

Abstract (DeepSeek V3)

We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models. Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training. In addition, its training process is remarkably stable. Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks. The model checkpoints are available at https://github.com/deepseek-ai/DeepSeek-V3.

Abstract (DeepSeek R1)

We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors. However, it encounters challenges such as poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates multi-stage training and cold-start data before RL. DeepSeekR1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks. To support the research community, we open-source DeepSeek-R1-Zero, DeepSeek-R1, and six dense models (1.5B, 7B, 8B, 14B, 32B, 70B) distilled from DeepSeek-R1 based on Qwen and Llama.

一些熟悉的东西

Reinforcement Learning
Multi-Head Latent Attention 【官方双语】直观解释注意力机制，Transformer的核心 | 【深度学习第6章】
……

总结

提问+追问
在大语言模型的帮助下动手编程，实现平时的一些想法
积淀（数学、经典文献）

And above all, learning everything in a top-down approach.

网站信息
工信部备案号	苏 ICP 备 2023037770 号-1
公安备案号	苏公网安备 32048202000362 号
发布许可证	CC BY-NC 4.0
联系方式	admin@macrohard.fun